Add perf profiling infrastructure, switch HXB to uncompressed storage, replace Domainslib with WorkerPool, and document compiler hotspots by Copilot · Pull Request #77 · Simn/haxe

Copilot · 2026-03-15T08:08:04Z

Profiling setup and analysis for the Haxe compiler eval path and HXB roundtrip, identifying GC pressure, type-traversal hotspots, and zip I/O as the primary cost centres. Includes a concrete performance fix for HXB zip I/O, a replacement of Domainslib with a custom WorkerPool, a comprehensive profiling guide, and a corrected analysis of polymorphic comparison sites verified via OCaml assembly inspection.

Infrastructure changes

src/dune: adds a profile build environment with (ocamlopt_flags (:standard -O2 -g -unbox-closures)) so a DWARF-instrumented binary can be produced without touching the release build:
```
dune build --profile profile src/haxe.exe
```
The resulting binary is accepted by perf record --call-graph dwarf and resolves OCaml symbols correctly.

Code changes

src/compiler/generate.ml: switched HXB zip compression from level 6 (Deflated) to level 0 (Stored). This eliminates all zlib compression/decompression overhead:
- Read-phase hxblib get bytes: 0.090 s → 0.021 s (4.3× faster)
- Archive size: 3.6 MB → 6.5 MB (1.8× larger — favorable tradeoff)
src/context/parallel.ml: replaced Domainslib with a custom WorkerPool using Domain.spawn + Condition.wait. Workers sleep (zero CPU) between parallel_for calls instead of spin-waiting. Key design points:
- Workers are spawned once at pool creation and block on Condition.wait when idle
- Work is distributed via contiguous chunk partitioning across N domains (main + N-1 workers)
- Nested parallel_for calls (e.g. analyzer iterating types → fields) detected via atomic busy flag and fall back to sequential, preventing state corruption
- Exception propagation with backtrace from worker domains
- ManagedPool retained as a thin lazy wrapper matching the old lifecycle
src/compiler/server/serverCompilationContext.ml: updated ManagedPool.create to use the new argument-free constructor.
src/dune and haxe.opam: removed domainslib dependency.

Documentation

extra/PROFILING.md: comprehensive profiling guide covering how to use perf, olly gc-stats, and --times flags, with full results, recommendations, and documentation of the WorkerPool architecture.

How to profile

# lower perf paranoia (requires sudo)
sudo sysctl -w kernel.perf_event_paranoid=-1

# record eval unit tests
perf record -F 999 -g --call-graph dwarf -o eval.perf \
  haxe --cwd tests/unit compile-macro.hxml

# built-in timer breakdown (most informative for day-to-day work)
haxe --cwd tests/unit compile-macro.hxml \
  --times -D times.eval -D times.filter=2 -D times.analyzer=2

# OCaml 5 GC profiling
olly gc-stats -- haxe --cwd tests/unit compile-macro.hxml

Findings

Eval unit tests (`--times` + `perf` + `olly`, ~2.7 s)

Source	Cost
GC (`do_some_marking` + alloc/promote/sweep)	~26 % of CPU (`olly` confirmed)
Macro execution (utest `ancestorHasInitializeUtest` alone)	28 %
Typing	26 %
Filters (`handle_abstract_casts`, `fix_return_dynamic_from_void_function`)	13 %
Parsing	12 %
Analyzer (fusion dominates)	7 %
`compare_val` (polymorphic `=`/`compare`)	3.1 %

`compare_val` analysis (3.1 % of perf time) — verified via assembly inspection

OCaml's native compiler specializes = for types with all-constant constructors and for comparisons against literal constant constructors. Verified by compiling test files with ocamlfind ocamlopt -S and inspecting the generated assembly.

Truly polymorphic (calls caml_equal):

typeloadCheck.ml:187 — field_kind = field_kind (has Var of var_kind)
optimizerTexpr.ml:189 — tconstant = tconstant (has TInt of int32, TString of string, etc.)
fields.ml:141,194 — e.eexpr = TConst TSuper (structured literal, but shallow comparison)
Path comparisons (~10 sites) — string list * string
nullSafety.ml — polymorphic Hashtbl with safety_subject keys (also contributes to caml_hash 2.4%)

NOT polymorphic (specialized by OCaml, corrected from initial analysis):

typeloadCheck.ml:66 — m1 = MethDynamic: method_kind all-constant constructors → direct cmpq
typeloadCheck.ml:432 — mkind m1 = mkind m2: projects to int → direct cmpq
fields.ml:41 — !(a.a_status) = Const: literal constant constructor → direct cmpq

HXB roundtrip

Write phase: +0.38 s over normal eval (13 %); HxbWriter.write_type_instance and Zlib.update_crc visible in perf.
Read phase: ~0.85 s — 3× faster than a full compile. After switching to Stored compression, hxblib get bytes dropped from 0.090 s to 0.021 s.

GC overhead (`olly gc-stats`)

Eval: 26 % of CPU, 30.7 % on main domain
Roundtrip: 43 % of CPU (write phase creates many temporary buffers)
P99 GC latency: 3.0 ms, max: 6.0 ms

WorkerPool benchmark (4-core CI runner)

Benchmark	Domainslib	WorkerPool	Delta
Eval unit tests (median of 5)	2762 ms	2809 ms	+1.7 % (noise)
JVM compilation (median of 3)	1062 ms	1070 ms	+0.8 % (noise)

Throughput is within measurement noise. The primary benefit is eliminating idle CPU consumption: Domainslib workers spin-wait even when no work is available, whereas WorkerPool workers block on Condition.wait and consume zero CPU between parallel sections.

Remaining improvement opportunities

compare_val — replace polymorphic (=)/compare with typed equality in the confirmed polymorphic sites (field_kind, tconstant, path comparisons). Switch nullSafety.ml to functorized hash tables with custom hash/equal for safety_subject.

💬 Send tasks to Copilot coding agent from Slack and Teams to turn conversations into code. Copilot posts an update in your thread when it's finished.

…-times data Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

Simn · 2026-03-15T09:07:49Z

GC pressure — dominant cost. Allocations in Texpr.map_expr, TFunctions.follow, and Stdlib.List.map are the primary drivers. Profile with memtrace to locate top allocators; consider in-place traversal patterns.

@copilot This is usually fine in OCaml due to its generational GC, so you might be jumping to conclusions too quickly here. But memtrace sounds like a good plan, go ahead and do that!

compare_val (3.3 %) — replace polymorphic (=)/compare with typed equality in hot typing/filter loops.

Indeed, please identify the places where this occurs.

HXB zip decompression — storing entries as Stored (uncompressed) eliminates decompression cost entirely at the expense of archive size; alternatively, pre-decompress and cache per-module bytes after first access.

Agreed, let's store without compression.

HxbWriter.write_type_instance — more aggressive type-reference deduplication (interned index) would shrink the archive and reduce write time.

I'm not sure what you mean here, please implement it so I can take a look!

Domainslib.Multi_channel spin-wait — worker domains spin even for small jobs; add a module-count threshold before spawning domains.

I don't have a good solution for this because the Domainslib API only allows fixed domain pools (I think). Can you think of a better approach for managing domains?

Also, please be aware that your opam installation got stuck waiting for a y input, so make sure OPAMYES=1 is set when running it.

…ith perf/olly/memtrace findings Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

Copilot · 2026-03-15T09:20:27Z

GC pressure — dominant cost. Allocations in Texpr.map_expr, TFunctions.follow, and Stdlib.List.map are the primary drivers. Profile with memtrace to locate top allocators; consider i...

Addressed in 930c728:

GC / memtrace: memtrace doesn't work on OCaml 5 multicore (Gc.Memprof is unsupported). I used olly gc-stats (OCaml 5's runtime_events_tools) instead — GC overhead is 26% of CPU for eval, 43% for roundtrip. Full results in extra/PROFILING.md.

compare_val (3.1%): Identified via perf --symbol-filter. Called through both caml_compare (2%) and caml_equal (1%). Key sites:

typeloadCheck.ml:187 — | a, b when a = b on field_kind
typeloadCheck.ml:66,432 — MethDynamic / method kind comparisons
fields.ml:41 — !(a.a_status) = Const on anon_status
optimizerTexpr.ml:189 — a = b on constants
nullSafety.ml — 26 Hashtbl.find/mem calls using polymorphic hash+equality

Full list in extra/PROFILING.md §2.

HXB compression: Changed to Stored (level 0). Read-phase get bytes dropped from 0.090s → 0.021s (4.3×). Archive grows 3.6 MB → 6.5 MB.

write_type_instance: Already deduplicated within expressions via write_texpr_type_instance which serializes to bytes then interns via StringPool. At 0.38% of perf time, the gains from further deduplication are minimal — left as-is with explanation in the guide.

Domainslib: The ManagedPool already supports lazy acquire/release. The issue is that Domainslib.Task.setup_pool creates a fixed set of domains that spin-wait. Two approaches: (1) use Domain.spawn directly for short parallel sections like HXB export, replacing ManagedPool for those paths, or (2) add a module-count threshold (e.g. >50 modules) before acquiring the pool. Approach (1) eliminates spin-wait entirely but loses automatic work-stealing; approach (2) is simpler. Documented in extra/PROFILING.md §5.

Also noted re: OPAMYES=1 — will use that going forward.

Simn · 2026-03-15T09:36:53Z

@copilot

Please investigate the = situation because I don't think you're right about that. If the types are known at compile-time in OCaml, this should not lead to a polymorphic compare. Do some research and inspect the generated code to confirm.
Could you suggest a framework for using from Domainslib to Domain.spawn? I don't really have a vision for how to control the number of domains and how to actually interact with such a setup. We don't want to constantly set up and tear down domains in such cases, so this has to be managed somehow.

…y inspection; propose Domain.spawn framework Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

Copilot · 2026-03-15T09:50:45Z

@copilot

Please investigate the = situation because I don't think you're right about that. If the types are known at compile-time in OCaml, this should not lead to a polymorphic...

Addressed in 7ed90ce and c8e7434.

= investigation: You were right — I verified by compiling test files with ocamlfind ocamlopt -S and inspecting the generated assembly. OCaml's native compiler specializes = in two cases:

Types where all constructors are constant (no args) — e.g. method_kind → emits cmpq
Comparing a variable against a literal constant constructor, even if the type has structured variants — e.g. anon_status = Const → emits cmpq $3, %rax

The sites I previously listed that are NOT polymorphic (corrected):

typeloadCheck.ml:66 — m1 = MethDynamic: method_kind all-constant → direct compare
typeloadCheck.ml:432 — mkind m1 = mkind m2: projects to int first → direct compare
fields.ml:41 — !(a.a_status) = Const: literal constant ctor → direct compare

Still truly polymorphic (calls caml_equal):

typeloadCheck.ml:187 — field_kind = field_kind (has Var of var_kind)
optimizerTexpr.ml:189 — tconstant = tconstant (has TInt, TString, etc.)
fields.ml:141,194 — e.eexpr = TConst TSuper (structured literal of huge variant type, but shallow)
Path comparisons — string list * string
nullSafety.ml — polymorphic Hashtbl with safety_subject keys (also contributes to the caml_hash 2.4%)

Full corrected analysis with assembly evidence in extra/PROFILING.md §2.

Domain.spawn framework: Added a concrete proposal in extra/PROFILING.md §5. The idea:

Replace the persistent Domainslib pool with Domain.spawn + Domain.join for each parallel section
All compiler parallel sections are simple data-parallel loops (array of modules/types) — no work-stealing or nested tasks needed
parallel_for ~num_domains length f partitions work into contiguous chunks, spawns N-1 domains, runs one chunk on the calling domain, then joins all
Add a min_parallel_items threshold (e.g. 32) to skip spawning for small arrays
ManagedPool.t is replaced by a simple num_domains : int in the server context
Zero idle CPU between parallel sections (no spin-wait)

Simn · 2026-03-15T10:00:38Z

@copilot I'm very curious about that parallel redesign. Go ahead and implement it, and make sure to measure the difference accurately. If this leads to the removal of Domainslib as a dependency, that'd be a bonus.

…ro idle CPU Replaces the Domainslib dependency with a custom WorkerPool that uses Domain.spawn for workers that block on Condition.wait between calls. Key design: - Workers sleep (zero CPU) when no parallel_for is active - Nested parallel_for calls detected via atomic busy flag, fall back to sequential (matches Domainslib's effective behavior) - Exception propagation with backtrace from worker domains - ManagedPool retained as lazy wrapper with same lifecycle Benchmark (4-core CI runner): Eval unit tests: 2762ms (Domainslib) → 2809ms (WorkerPool) [+1.7%, noise] JVM compilation: 1062ms (Domainslib) → 1070ms (WorkerPool) [+0.8%, noise] Removes domainslib from src/dune and haxe.opam. Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

…ughput Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

Copilot · 2026-03-15T10:32:20Z

@copilot I'm very curious about that parallel redesign. Go ahead and implement it, and make sure to measure the difference accurately. If this leads to the removal of Domainslib as a dependency, that'...

Implemented in d9dd515. domainslib is fully removed from src/dune and haxe.opam.

Design: WorkerPool in src/context/parallel.ml spawns N-1 worker domains at pool creation. Workers block on Condition.wait between parallel_for calls (zero CPU). Work is distributed via contiguous chunk partitioning. A key detail: nested parallel_for calls (e.g. analyzer iterating types → iterating fields per type) are detected via an atomic busy flag and fall back to sequential, which prevents state corruption.

Benchmarks (4-core CI, median):

	Domainslib	WorkerPool
Eval	2762 ms	2809 ms (+1.7%)
JVM	1062 ms	1070 ms (+0.8%)

Within noise. The throughput is equivalent; the benefit is zero idle CPU (workers sleep instead of spin-waiting).

Initial plan

fe7b156

Copilot AI assigned Copilot and Simn Mar 15, 2026

Copilot started work on behalf of Simn March 15, 2026 08:08 View session

Add profile dune env for perf-instrumented builds; collect perf and -…

c22ce2a

…-times data Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add setup for low-level profiling of compiler~~ Add perf profiling infrastructure and document compiler hotspots Mar 15, 2026

Copilot AI requested a review from Simn March 15, 2026 09:01

Copilot stopped work on behalf of Simn due to an error March 15, 2026 09:01
The "Initialize opam and install OCaml 5.3.0" custom setup step from your .github/workflows/copilot-setup-steps.yml file failed.

Copilot started work on behalf of Simn March 15, 2026 09:07 View session

Switch HXB zip to Stored (level 0) compression; add profiling guide w…

930c728

…ith perf/olly/memtrace findings Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

Copilot AI changed the title ~~Add perf profiling infrastructure and document compiler hotspots~~ Add perf profiling infrastructure, switch HXB to uncompressed storage, and document compiler hotspots Mar 15, 2026

Copilot finished work on behalf of Simn March 15, 2026 09:21

Copilot started work on behalf of Simn March 15, 2026 09:36 View session

Copilot AI and others added 2 commits March 15, 2026 09:49

Correct compare_val analysis: verify OCaml specialization via assembl…

7ed90ce

…y inspection; propose Domain.spawn framework Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

Use safe array access in Domain.spawn example code

c8e7434

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

Copilot finished work on behalf of Simn March 15, 2026 09:51

Copilot started work on behalf of Simn March 15, 2026 10:00 View session

Copilot AI and others added 2 commits March 15, 2026 10:30

Clarify benchmark wording: primary benefit is zero idle CPU, not thro…

b92a4b3

…ughput Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

Copilot AI changed the title ~~Add perf profiling infrastructure, switch HXB to uncompressed storage, and document compiler hotspots~~ Add perf profiling infrastructure, switch HXB to uncompressed storage, replace Domainslib with WorkerPool, and document compiler hotspots Mar 15, 2026

Copilot finished work on behalf of Simn March 15, 2026 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add perf profiling infrastructure, switch HXB to uncompressed storage, replace Domainslib with WorkerPool, and document compiler hotspots#77

Add perf profiling infrastructure, switch HXB to uncompressed storage, replace Domainslib with WorkerPool, and document compiler hotspots#77
Copilot wants to merge 7 commits intosimn-developmentfrom
copilot/prepare-low-level-profiling-tools

Copilot AI commented Mar 15, 2026 •

edited

Loading

Uh oh!

Simn commented Mar 15, 2026

Uh oh!

Copilot AI commented Mar 15, 2026

Uh oh!

Simn commented Mar 15, 2026

Uh oh!

Copilot AI commented Mar 15, 2026

Uh oh!

Simn commented Mar 15, 2026

Uh oh!

Copilot AI commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Infrastructure changes

Code changes

Documentation

How to profile

Findings

Eval unit tests (--times + perf + olly, ~2.7 s)

compare_val analysis (3.1 % of perf time) — verified via assembly inspection

HXB roundtrip

GC overhead (olly gc-stats)

WorkerPool benchmark (4-core CI runner)

Remaining improvement opportunities

Uh oh!

Simn commented Mar 15, 2026

Uh oh!

Copilot AI commented Mar 15, 2026

Uh oh!

Simn commented Mar 15, 2026

Uh oh!

Copilot AI commented Mar 15, 2026

Uh oh!

Simn commented Mar 15, 2026

Uh oh!

Copilot AI commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 15, 2026 •

edited

Loading

Eval unit tests (`--times` + `perf` + `olly`, ~2.7 s)

`compare_val` analysis (3.1 % of perf time) — verified via assembly inspection

GC overhead (`olly gc-stats`)